Gradient-based learning drives robust representations in recurrent neural networks by balancing compression and expansion

Farrell, Matthew; Recanatesi, Stefano; Moore, Timothy; Lajoie, Guillaume; Shea-Brown, Eric

doi:10.1038/s42256-022-00498-0

Article
Published: 22 June 2022

Gradient-based learning drives robust representations in recurrent neural networks by balancing compression and expansion

Matthew Farrell ORCID: orcid.org/0000-0001-8359-8666^1,2^nAff6,
Stefano Recanatesi²,
Timothy Moore²,
Guillaume Lajoie^3,4 &
…
Eric Shea-Brown^1,2,5

Nature Machine Intelligence volume 4, pages 564–573 (2022)Cite this article

3942 Accesses
6 Citations
31 Altmetric
Metrics details

Subjects

An Author Correction to this article was published on 19 October 2022

This article has been updated

Abstract

Neural networks need the right representations of input data to learn. Here we ask how gradient-based learning shapes a fundamental property of representations in recurrent neural networks (RNNs)—their dimensionality. Through simulations and mathematical analysis, we show how gradient descent can lead RNNs to compress the dimensionality of their representations in a way that matches task demands during training while supporting generalization to unseen examples. This can require an expansion of dimensionality in early timesteps and compression in later ones, and strongly chaotic RNNs appear particularly adept at learning this balance. Beyond helping to elucidate the power of appropriately initialized artificial RNNs, this fact has implications for neurobiology as well. Neural circuits in the brain reveal both high variability associated with chaos and low-dimensional dynamical structures. Taken together, our findings show how simple gradient-based learning rules lead neural networks to solve tasks with robust representations that generalize to new cases.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 2: Dynamical and geometric properties of networks learning to classify high-dimensional inputs.**

**Fig. 3: Dynamical and geometric properties of networks learning to classify two-dimensional inputs.**

**Fig. 4: Dynamical and geometric properties of networks learning to classify two-dimensional inputs restricted to two neurons.**

**Fig. 5: Networks with mean squared error loss and linear units continue to compress dimensionality.**

**Fig. 6: Variability of gradients compresses neural representations in networks with injected hidden unit noise and regularization.**

Teaching recurrent neural networks to infer global temporal structure from local examples

Article 19 April 2021

Efficient neural codes naturally emerge through gradient descent learning

Article Open access 29 December 2022

Predicting trends in the quality of state-of-the-art neural networks without access to training or testing data

Article Open access 05 July 2021

Data availability

All data used in the paper are generated by the code at refs. ⁶¹.

Code availability

Code for training the networks and generating the plots can be found in a Code Ocean capsule⁶¹.

Change history

19 October 2022
A Correction to this paper has been published: https://doi.org/10.1038/s42256-022-00565-6

References

Cover, T. M. Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Trans. Electron. Comput. EC-14, 326–334 (1965).
Article MATH Google Scholar
Fusi, S., Miller, E. K. & Rigotti, M. Why neurons mix: high dimensionality for higher cognition. Curr. Opin. Neurobiol. 37, 66–74 (2016).
Article Google Scholar
Vapnik, V. N. Statistical Learning Theory (Wiley-Interscience, 1998).
Litwin-Kumar, A., Harris, K. D., Axel, R., Sompolinsky, H. & Abbott, L. F. Optimal degrees of synaptic connectivity. Neuron 93, 1153–1164 (2017).
Article Google Scholar
Cayco-Gajic, N. A., Clopath, C. & Silver, R. A. Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nat. Commun. 8, 1116 (2017).
Article Google Scholar
Wallace, C. S. & Boulton, D. M. An information measure for classification. Comput. J. 11, 185–194 (1968).
Article MATH Google Scholar
Rissanen, J. Modeling by shortest data description. Automatica 14, 465–471 (1978).
Article MATH Google Scholar
Bengio, Y., Courville, A. & Vincent, P. Representation learning: a review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35, 1798–1828 (2013).
Article Google Scholar
Ansuini, A., Laio, A., Macke, J. H. & Zoccolan, D. Intrinsic dimension of data representations in deep neural networks. Adv. Neural Inf. Process. Syst. 32, 11 (2019).
Recanatesi, S. et al. Dimensionality compression and expansion in Deep Neural Networks. Preprint at https://arxiv.org/abs/1906.00443 (2019).
Cohen, U., Chung, S. Y., Lee, D. D. & Sompolinsky, H. Separability and geometry of object manifolds in deep neural networks. Nat. Commun. 11, 746 (2020).
Jaeger, H. The ‘Echo State’ Approach to Analysing and Training Recurrent Neural Networks—with an Erratum Note. GMD Technical Report 148 (German National Research Center for Information Technology, 2001).
Maass, W., Natschläger, T. & Markram, H. Real-time computing without stable states: a new framework for neural computation based on perturbations. Neural Comput. 14, 2531–2560 (2002).
Legenstein, R. & Maass, W. Edge of chaos and prediction of computational performance for neural circuit models. Neural Netw. 20, 323–334 (2007).
Article MATH Google Scholar
Keup, C., Tobias K., David D. & Moritz H. Transient chaotic dimensionality expansion by recurrent networks. Physical Review X 11 (June 2021): 021064. https://doi.org/10.1103/PhysRevX.11.021064
Vreeswijk, C. V. & Sompolinsky, H. Chaotic balanced state in a model of cortical circuits. Neural Comput. 10, 1321–1371 (1998).
Article Google Scholar
Litwin-Kumar, A. & Doiron, B. Slow dynamics and high variability in balanced cortical networks with clustered connections. Nat. Neurosci. 15, 1498–1505 (2012).
Article Google Scholar
Wolf, F., Engelken, R., Puelma-Touzel, M., Weidinger, J. D. F. & Neef, A. Dynamical models of cortical circuits. Curr. Opin. Neurobiol. 25, 228–236 (2014).
Article Google Scholar
Lajoie, G., Lin, K. & Shea-Brown, E. Chaos and reliability in balanced spiking networks with temporal drive. Phys. Rev. E 87, 2432–2437 (2013).
Article Google Scholar
London, M., Roth, A., Beeren, L., Häusser, M. & Latham, P. E. Sensitivity to perturbations in vivo implies high noise and suggests rate coding in cortex. Nature 466, 123–127 (2010).
Article Google Scholar
Stam, C. J. Nonlinear dynamical analysis of EEG and MEG: review of an emerging field. Clin. Neurophysiol. 116, 2266–2301 (2005).
Article Google Scholar
Engelken, R. & Wolf, F. Dimensionality and entropy of spontaneous and evoked rate activity. In APS March Meeting Abstracts, Bull. Am. Phys. Soc. eP5.007 (2017).
Kaplan, L. J. & Yorke, J. A. In Functional Differential Equations and Approximations of Fixed Points: Proceedings, Bonn, July 1978 204–227 (Springer, 1979).
Sussillo, D. & Abbott, L. F. Generating coherent patterns of activity from chaotic neural networks. Neuron 63, 544–557 (2009).
Article Google Scholar
DePasquale, B., Cueva, C. J., Rajan, K., Escola, G. S. & Abbott, L. F. full-FORCE: A target-based method for training recurrent networks. PLoS ONE 13, e0191527 (2018).
Article Google Scholar
Stern, M., Olsen, S., Shea-Brown, E., Oganian, Y. & Manavi, S. In the footsteps of learning: changes in network dynamics and dimensionality with task acquisition. In Proc. COSYNE 2018, abstract no. III-100.
Farrell, M. Revealing Structure in Trained Neural Networks Through Dimensionality-Based Methods. PhD thesis, Univ. Washington (2020).
Rajan, K., Abbott, L. F. & Sompolinsky, H. Stimulus-dependent suppression of chaos in recurrent neural networks. Phys. Rev. E 82, 011903 (2010).
Article Google Scholar
Bell, R. J. & Dean, P. Atomic vibrations in vitreous silica. Discuss. Faraday Soc. 50, 55–61 (1970).
Article Google Scholar
Gao, P., Trautmann, E., Yu, B. & Santhanam, G. A theory of multineuronal dimensionality, dynamics and measurement. Preprint at bioRxiv https://doi.org/10.1101/214262 (2017).
Riesenhuber, M. & Poggio, T. Hierarchical models of object recognition in cortex. Nat. Neurosci. 2, 1019–1025 (1999).
Goodfellow, I., Lee, H., Le, Q. V., Saxe, A. & Ng, A. Y. Measuring invariances in deep networks. Adv. Neural Inf. Process. Syst. 22, 646–654 (2009).
Lajoie, G., Lin, K. K., Thivierge, J.-P. & Shea-Brown, E. Encoding in balanced networks: revisiting spike patterns and chaos in stimulus-driven systems. PLoS Comput. Biol. 12, e1005258 (2016).
Article Google Scholar
Huang, H. Mechanisms of dimensionality reduction and decorrelation in deep neural networks. Phys. Rev. E 98, 062313–062322 2018).
Article Google Scholar
Kadmon, J. & Sompolinsky, H. Optimal architectures in a solvable model of deep networks. Adv. Neural Inf. Process. Syst. 29, 4781–4789 (2016).
Papyan, V., Han, X. Y. & Donoho, D. L. Prevalence of neural collapse during the terminal phase of deep learning training. Proc. Natl Acad. Sci. USA 117, 24652–24663 (2020).
Article MathSciNet MATH Google Scholar
Shwartz-Ziv, R. & Tishby, N. Opening the black box of Deep Neural Networks via information. Preprint at https://arxiv.org/abs/1703.00810 (2017).
Shwartz-Ziv, R., Painsky, A. & Tishby, N. Representation compression and generalization in Deep Neural Networks. Preprint: OpenReview (2019).
Babadi, B. & Sompolinsky, H. Sparseness and expansion in sensory representations. Neuron 83, 1213–1226 (2014).
Article Google Scholar
Marr, D. A theory of cerebellar cortex. J. Physiol. 202, 437–470.1 (1969).
Article Google Scholar
Albus, J. S. A theory of cerebellar function. Math. Biosci. 10, 25–61 (1971).
Article Google Scholar
Stringer, C., Pachitariu, M., Steinmetz, N., Carandini, M. & Harris, K. D. High-dimensional geometry of population responses in visual cortex. Nature 571, 361–365 (2019).
Mazzucato, L., Fontanini, A. & LaCamera, G. Stimuli reduce the dimensionality of cortical activity. Front. Syst. Neurosci. 10, 11 (2016).
Article Google Scholar
Rosenbaum, R., Smith, M. A., Kohn, A., Rubin, J. E. & Doiron, B. The spatial structure of correlated neuronal variability. Nat. Neurosci. 20, 107–114 (2017).
Landau, I. D. & Sompolinsky, H. Coherent chaos in a recurrent neural network with structured connectivity. PLoS Comput. Biol. 14, e1006309 (2018).
Huang, C. et al. Circuit models of low-dimensional shared variability in cortical networks. Neuron 101, 337–348.e4 (2019).
Article Google Scholar
Mastrogiuseppe, F. & Ostojic, S. Linking connectivity, dynamics, and computations in low-rank recurrent neural networks. Neuron 99, 609–623.e29 (2018).
Article Google Scholar
Mazzucato, L., Fontanini, A. & La Camera, G. Dynamics of multistable states during ongoing and evoked cortical activity. J. Neurosci. 35, 8214–8231 (2015).
Article Google Scholar
Cunningham, J. P. & Yu, B. M. Dimensionality reduction for large-scale neural recordings. Nat. Neurosci. 17, 1500–1509 (2014).
Article Google Scholar
Goodfellow, I., Bengio, Y. & Courville, A. Deep Learning (MIT Press, 2016); http://www.deeplearningbook.org
Faisal, A. A., Selen, L. P. J. & Wolpert, D. M. Noise in the nervous system. Nat. Rev. Neurosci. 9, 292–303 (2008).
Article Google Scholar
Freedman, D. J. & Assad, J. A. Experience-dependent representation of visual categories in parietal cortex. Nature 443, 85–88 (2006).
Article Google Scholar
Dangi, S., Orsborn, A. L., Moorman, H. G. & Carmena, J. M. Design and analysis of closed-loop decoder adaptation algorithms for brain–machine interfaces. Neural Comput. 25, 1693–1731 (2013).
Article MathSciNet MATH Google Scholar
Orsborn, A. L. & Pesaran, B. Parsing learning in networks using brain–machine interfaces. Curr. Opin. Neurobiol. 46, 76–83 (2017).
Article Google Scholar
Recanatesi, S. et al. Predictive learning as a network mechanism for extracting low-dimensional latent space representations. Nat. Commun. 12, 1417 (2021).
Banino, A. et al. Vector-based navigation using grid-like representations in artificial agents. Nature 557, 429 (2018).
Article Google Scholar
Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M. & Tang, P. T. P. On large-batch training for deep learning: generalization gap and sharp minima. In 5th International Conference on Learning Representations https://doi.org/10.48550/arXiv.1609.04836 (2017).
Advani, M. S., Saxe, A. M. & Sompolinsky, H. High-dimensional dynamics of generalization error in neural networks. Neural Netw. 132, 428–446 (2020).
Article MATH Google Scholar
Li, Y. & Liang, Y. Learning overparameterized neural networks via stochastic gradient descent on structured data. in Advances in neural information processing systems (eds. Bengio, S. et al.) vol. 31 (Curran Associates, Inc., 2018).
Lipton, Z. C., Berkowitz, J. & Elkan, C. A critical review of recurrent neural networks for sequence learning. https://arxiv.org/abs/1506.00019 (2015).
Farrell, M. Gradient-based learning drives robust representations in RNNs by balancing compression and expansion. Code Ocean https://doi.org/10.24433/CO.5101546.v1 (2022).

Download references

Acknowledgements

M.F. was funded by the National Science Foundation Graduate Research Fellowship under Grant DGE-1256082. G.L. is funded by an NSERC Discovery Grant (RGPIN-2018-04821), an FRQNT Young Investigator Startup Program (2019-NC-253251) and an FRQS Research Scholar Award, Junior 1 (LAJGU0401-253188). E.S.-B. acknowledges the support of NSF DMS Grant 1514743. M.F. thanks the Swartz Program in Theoretical Neuroscience at Harvard and S.R. thanks the Swartz Center for Theoretical Neuroscience at the University of Washington for support. We thank M. Stern, D. Chklovskii, A. Weber, N. Steinmetz and L. Mazzucato for their insights and suggestions. M.F. would also like to thank H. Sompolinsky and S. Chung for their mentorship and inspiration.

Author information

Matthew Farrell
Present address: John A. Paulson School of Engineering and Applied Sciences, Harvard University, Cambridge, MA, USA

Authors and Affiliations

Department of Applied Mathematics, University of Washington, Seattle, WA, USA
Matthew Farrell & Eric Shea-Brown
Computational Neuroscience Center, University of Washington, Seattle, WA, USA
Matthew Farrell, Stefano Recanatesi, Timothy Moore & Eric Shea-Brown
Mila, Montreal, Quebec, Canada
Guillaume Lajoie
Département de mathématiques et de statistique, Université de Montréal Pavillon André-Aisenstadt, Montreal, Quebec, Canada
Guillaume Lajoie
Allen Institute for Brain Science, Seattle, WA, USA
Eric Shea-Brown

Authors

Matthew Farrell
View author publications
You can also search for this author in PubMed Google Scholar
Stefano Recanatesi
View author publications
You can also search for this author in PubMed Google Scholar
Timothy Moore
View author publications
You can also search for this author in PubMed Google Scholar
Guillaume Lajoie
View author publications
You can also search for this author in PubMed Google Scholar
Eric Shea-Brown
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

M.F., S.R. and E.S.-B. conceived the study. M.F. wrote code and ran simulations with some guidance from S.R. The manuscript was primarily written by M.F., with substantial edits and contributions made by S.R., G.L. and E.S.-B. G.L. contributed code for computing Lyapunov exponents and provided additional insight. T.M. ran the simulations for Extended Data Fig. 9 and ran additional verification experiments for intermediate values of β.

Corresponding author

Correspondence to Matthew Farrell.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Cristina Savin and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Effects of changing the evaluation timestep and number of recurrent units.

Strongly chaotic (red) and edge-of-chaos (cyan) networks are trained to classify high-dimensional inputs. Details are as in Fig. 2e. Shaded regions are as defined in Fig. 2e. First row: network trained with a categorical cross-entropy loss with a learning rate of 1e-4. Second row: network trained with a mean squared error loss with a learning rate of 1e-3. First column: evaluation time is t = 6. Second column: evaluation time is t = 10. Third column: evaluation time is t = 14. Fourth column: Number of hidden neurons is increased to N = 300. Evaluation time is t = 14.

Extended Data Fig. 2 Effects of changing the evaluation timestep, input dimension, and number of recurrent units.

Strongly chaotic (red) and edge-of-chaos (cyan) networks are trained to classify low-dimensional inputs. Dashed lines depict the network before training, and solid is after training. All networks are trained with a categorical cross-entropy loss and a learning rate of 1e-4 (note that this is a factor of 10 less than used in the main text). Shaded regions are as defined in Fig. 2e. Other details are as in Fig. 3e. First row: 2-dimensional inputs. Second row: 4-dimensional inputs. Third row: 10-dimensional inputs. First column: evaluation time is t = 6. Second column: evaluation time is t = 10. Third column: evaluation time is t = 14. Fourth column: Number of hidden neurons is increased to N = 300. Evaluation time is t = 14.

Extended Data Fig. 3 Effects of changing the evaluation timestep, input dimension, and number of recurrent units on logistic regression testing accuracy.

Strongly chaotic (red) and edge-of-chaos (cyan) networks are trained to classify low-dimensional inputs. Dashed lines depict the network before training, and solid is after training. Details are as in Extended Data Fig. 2, but this time measuring the logistic regression testing accuracy as defined in the main text.

Extended Data Fig. 4 Effects of changing the evaluation timestep, input dimension, and number of recurrent units with 120 input clusters.

Strongly chaotic (red) and edge-of-chaos (cyan) networks are trained to classify low-dimensional inputs. Dashed lines depict the network before training, and solid is after training. Details are as in Extended Data Fig. 2, but here 120 input clusters are used instead of 60.

Extended Data Fig. 5 Between-class distances are increased while within class distances are diminished by the network dynamics.

Mean pairwise distance between points belonging to the same class (dashed lines), mean pairwise distance between points belonging to different classes (dotted lines), and the ratio of the first to the second (blue lines and axes), for the representations of trained networks over time t. Details are as in Fig. 2e in the main text. Shaded regions are as defined in Fig. 2e. a. Edge-of-chaos network as defined in the main text. b. Strongly chaotic network as defined in the main text.

Extended Data Fig. 6 Dependence of dimensionality on the learning rate.

Here we reproduce the results of Figs. 2e and 3e of the main text, using a different learning rate. Red lines correspond to strongly chaotic and cyan lines to edge-of-chaos networks. Dashed and solid lines depict before and after training, respectively. Shaded regions are as defined in Fig. 2e. Top row: high-dimensional inputs as in Fig. 2e. Bottom row: low-dimensional inputs as in Fig. 3e. Left column: learning rate of 1e-4. Right column: learning rate of 1e-3, as in the main text.

Extended Data Fig. 7 Dimensionality increases with number of class labels, but not number of clusters.

Effective dimensionalities (EDs) of the trained network responses to inputs embedded in an N-dimensional space, measured at the evaluation time t_eval = 10. Error bars denote two standard deviations of three initializations of task and networks (in all panels they are too small to see). Details are similar to Fig. 2. a. Edge-of-chaos networks. Blue: ED of the inputs. Green: ED of the network representation as a function of the number of input clusters. Dimensionality remains flat and small. b. Edge-of-chaos networks. Green: ED of the network representation as a function of the number of class labels. Black: Effective dimensionality of points distributed uniformly at random in an N-dimensional ball. The number of points drawn is determined by the number of class labels. This is to roughly measure what the ED of the network would be if it formed a fixed point for every class label, and distributed these fixed points randomly in space. c. Strongly chaotic networks. Legend as in a. d. Strongly chaotic networks. Legend as in b.

Extended Data Fig. 8 Example of noise in output weights driving compression of the hidden representation in a linear network with two hidden layer units.

The equation for the network is h = Wx + b with output \(\hat{o}={{{{\boldsymbol{r}}}}}^{T}{{{\boldsymbol{h}}}}\). The input weights (red) are initialized to the 2 × 2 identity matrix, and bias is initialized as (1, 0). The inputs are placed on a grid from x = − 1 to x = 2 and from y = − 3 to y = 3 (not shown). Network output \(\hat{o}\) is trained to minimize the squared error loss \(0.5{(\hat{o}-1)}^{2}\). Input samples are chosen randomly, and input weights are updated via stochastic gradient descent with batch size 1. a. Top: Diagram of network where input weights are trained and output weights are fixed. Bottom: Diagram of network where input weights are trained and output weights are drawn from a normal distributed with mean (1, 0) and covariance 0.05I at every update step. In the figure, η represents additive white noise. Middle: hidden unit responses (blue circles) to the inputs before training (iteration 0). Black dot denotes the output weight vector, and the blue line is the affine subspace of points that r maps to 1. b. Evolution of the hidden layer response to inputs (representation) as input weights are trained. Top: Representation of the network where output weights are fixed. The iteration number denotes the number of training samples that have been used to update the weights. Activations compress to the space orthogonal to r, shifted by (1, 0). Bottom: Representation of the network where output weights are randomly drawn at every input sample presentation. Activations compress to a compact, localized space. The direction of compression is both along and orthogonal to r.

Extended Data Fig. 9 Representations of RNNs trained on the MNIST digit recognition dataset.

a. Effective dimensionality (ED) of RNNs trained on the MNIST digit recognition dataset. The ED of the network’s responses to test inputs is plotted. After training, dimensionality compresses down to a value by t = 10 that roughly matches the number of class labels (10). This compression is similar to that seen in Fig. 2 of the main text. Details are as in Fig. 2e of the main text. Shaded regions are as defined in Fig. 2e. b. Projection onto the top three principal components of MNIST test data. Colours indicate true class label (i.e., digit identity). c. Projection onto the top three principal components of the edge-of-chaos recurrent network’s responses to the inputs in b after training, at the evaluation time t = 10. Colours indicate true class label as in b. The network forms a localized cluster for each digit.

Extended Data Fig. 10 Effects of changing the initial coupling strength β.

Details are as in Fig. 3e, except that here we vary the coupling parameter β, whose value is indicated by the colourbar to the right. Shaded regions are as defined in Fig. 2e. First column: ED of networks before training. Second column: ED of networks after training with a learning rate of 1e-3. Third column: ED of networks after training with a learning rate of 1e-4.

Supplementary information

Supplementary Information

Supplementary Appendix.

Reporting Summary

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Farrell, M., Recanatesi, S., Moore, T. et al. Gradient-based learning drives robust representations in recurrent neural networks by balancing compression and expansion. Nat Mach Intell 4, 564–573 (2022). https://doi.org/10.1038/s42256-022-00498-0

Download citation

Received: 22 May 2020
Accepted: 10 May 2022
Published: 22 June 2022
Issue Date: June 2022
DOI: https://doi.org/10.1038/s42256-022-00498-0

This article is cited by

Inversion dynamics of class manifolds in deep learning reveals tradeoffs underlying generalization
- Simone Ciceri
- Lorenzo Cassani
- Marco Gherardi
Nature Machine Intelligence (2024)